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Abstract 

In silico methods for the prediction of antigenic peptides binding to MHC 
class I molecules play an increasingly important role in the identification of T- 
cell epitopes. Statistical and machine learning methods, in particular, are widely 
used to score candidate epitopes based on their similarity with known epitopes 
and non epitopes. The genes coding for the MHC molecules, however, are highly 
polymorphic, and statistical methods have difficulties to build models for alleles 
with few known epitopes. In this case, recent works have demonstrated the 
utility of leveraging information across alleles to improve the performance of the 
prediction. 

We design a support vector machine algorithm that is able to learn epitope 
models for all alleles simultaneously, by sharing information across similar alleles. 
The sharing of information across alleles is controlled by a user-defined measure 
of similarity between alleles. We show that this similarity can be defined in terms 
of supertypes, or more directly by comparing key residues known to play a role in 
the peptide-MHC binding. We illustrate the potential of this approach on various 
benchmark experiments where it outperforms other state-of-the-art methods. 
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1 Introduction 



A key step in the immune response to pathogen invasion is the activation of cytotoxic 
T-cells, which is triggered by the recognition of a short peptide, called epitope, bound 
to Major Histocompatibility Complex (MHC) class I molecules and presented to the 
T-cells. This recognition is supposed to trigger cloning and activation of cytotoxic 
lymphocytes able to identify and destroy the pathogen or infected cells. MHC class 
I epitopes are therefore pote ntial tools for the deve l opme nt of peptide vaccines, in 
particular for AIDS vaccines (IMcMichael and Hankd. I2002T). Thev are als o potential 
tools for diagnosis and treatment of cancer (jWang . 1 19991 : Sette et al . 2001 ). 

Identifying MHC class I epitope in a pathogen genome is therefore crucial for vac- 
cine design. However, not all peptides of a pathogen can bind to the MHC molecule 
to be presented to T-cells: it is estimated that only 1 i n 100 or 200 peptides actually 
binds to a particular MHC (lYewdell and Benninkl . In order to alleviate the cost 

and time required to identify epitopes experimentally, in silico computational methods 
for epitope prediction are therefore increasingly used. Structural approaches, on the 
one hand, try to evaluate how well a candidate epitope fit in the binding groove of 



a MHC molecule, by various threading or dock i ng approa c hes (IRosenfeld et all 11995 



Schueler-Furman et all l2000l : iTong et all 120061 ; iBui et all 120061 ) . Sequence-based ap 



proaches, on the other hand, estimate predictive models for epitopes by analyzing 
and lea rning from sets o f known epitopes and nqn-epito pes. Mod els can be ba s ed on 



Rammensee et al. 
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motifs (|Rotzschke et all ll 9921: iRammensee et all 1 19951 ) . profiles (IParker et all Il994 



^ i - - ■/ 7 xr 1! — — ■? ■ ^ 

Reche et all 120021). or machine l e arning methods like arti- 



ficial neural netw orks (Honevman et all 1 19981: iMilik et all 1 19981 : iBrusic et all 2002 



Buus et al. . 2003 : Nielsen et al. . 20031 : Zhang et all. 20051). hidden Markov models (iMamitsuka 



1998h. support vector mac hines fSVM) (bonnes and Elofssonl . 120021 : Izhao ef all I2OO3I: 
Bhasin and Raghava . 2004 : Salomon and Flower . 20061 ). boosted metric learning ( Hertz and Yanover . 



2006) or logistic regression ( jHeckerman et all . \2006 \). Finally, some aut hors have re- 



cently proposed t o com bine structural and sequence-based approaches (lAntes et al. 



20061 : 1 Jojic et all 120061 ). Although comparison is difficult, sequence-based approaches 



that learn a model from the analysis of known epitopes benefit from the accumulation 
of experimentally validated epitopes and will certainly continue to improve as more 
data become available. 

The binding affinity of a peptide depends on the MHC molecule's 3D structure and 
physicochemical properties, which in turns vary between MHC alleles. This compels 
any prediction method to be allele-specific: indeed, the fact that a peptide can bind 
to an allele is neither sufficient nor necessary for it to bind to another allele. Since 
MHC genes are highly polymorphic, little training data if any is available for some 
alleles. Thus, though achieving good precisions in general, classical statistical and 
machine learning-based MHC-peptide binding prediction methods fail to efficiently 
predict bindings for these alleles. 

Some alleles, however, can share binding prop e rties. In pa rticular, experimental 



work (jSidney et all 1 19951 . 1 19961 : ISette and Sidney! , ll 9981 . ll 9991 ) shows that different 
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alleles have overlapping peptide repertoires. This fact, together with the posterior ob- 
servation of structural similarities among the alleles sharing their repertoires allowed 
the definition of HLA allele supertypes, which are families of alleles exhibiting the 
same behavior in terms of peptide binding. This suggests that sharing information 
about known epitopes across different but similar alleles has the potential to improve 
predictiv e models by incre asing the quantity of data used to establish the model. For 
example, IZhu et all ( 120061 ) show that simply pooling together known epitopes for dif- 
ferent alleles of a given su p ertyp e to train a model can improve the accuracy of the 
model. iHertz and Yanoverl (120061 ) pool together epitope data for all alleles simultane- 
ously to learn a metric between peptides, which is then used to build predictive models 
for each allele. Finally, iHeckerman et al\ (120061 ) show that leveraging the information 
across MHC alleles and supertypes considerably improves individual allele prediction 
accuracy. 

In this paper we show how this strategy of leveraging information across different 
alleles when learning allele-specific epitope prediction models can be naturally per- 
formed in the context of SVM, a state-of-the-art machine le arning algorithm. Thi s 



new formulation is based on the notion of multitask kernels (lEvgeniou et all 120051 ) 



a general framework for solving several related machine learning problems simulta- 
neously. Known epitopes for a given allele contribute to the model estimation for 
all other alleles, with a weight that depends on the similarity between alleles. Here 
the notion of similari t y bet ween alleles can be very general; we can for example fol- 
low IHeckerman et all ( 120061 ) and define two alleles to be similar if they belong to the 
same supertype, but the flexibility of our mathematical formulation also allows for 
more subtle notions of similarity, based for example of sequence similarity between 
alleles. On a benchmark experiment we demonstrate the relevance of the multitask 
SVM approach which outperforms state-of-the-art prediction methods. 



2 Methods 

In this section, we explain how information can be shared between alleles when SVM 
models are trained on different alleles. For the sake of clarity we first explain the 
approach in the case of linear classifiers, and then generalize it to more general models. 

2.1 Sharing information with linear classifiers 

Let us first assume that epitopes are represented by <i-dimensional vectors x, and that 
for each allele a we want to learn a linear function f a (x) = w T x to discriminate between 
epitopes and non-epitopes, where w G M. d . A natural way to share information between 
different alleles is to assume that each vector w is the sum of a common vector w c which 
is common to all alleles, and of an allele-specific vector w a , resulting in a classifier: 

f a (x) = (W c + W a ) T X . (1) 
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In this equation the first term w c accounts for general characteristics of epitopes valid 
for all alleles, while the second term w a accounts for allele-specific properties of epitopes. 
In order to estimate such a model from data it is convenient to rewrite it as a simple 
linear model in a larger space as follows. Assuming that there are p alleles {ax, . . . , a p } 
we can indeed rewrite (DQ) as: 

f a (x) = W T <S>(a,x), (2) 

where W is the d x (p+ l)-dimensional vector W = (wj, wj x , . . . , wj p j and $(a, x) = 

(x T , T , . . . , T , x T , T , . . . , T ) T G ]R dx ( p+1 ) is the vector obtained by concatenating 
the vector x with p blocks of zeros, except for the a-th block which is a copy of x. 
Indeed it is then easy to check that W T §(a, x) = {w c + w a ) T x, hence that and ([2) 
are equivalent. Each (peptide, allele) pair is therefore mapped to a large vector a) 
with only two non-zero parts, one common to all alleles and one at an allele-specific 
position. 

The parameters of this model, namely the weights w c and w a for all alleles a, 
can then be learned simultaneously by any linear model, such as logistic regression or 
SVM, that estimates a vector W in ([21) from a training set ((x\, a±, yi), . . . , (x n , a n , y n )) 
of (peptide, allele) pairs labeled as yi = +1 if p eptide Xj is an epi t ope o f allele a i7 yi = 
— 1 otherwise. This approach was followed by iHeckerman et al. who included 



another level of granularity to describe how information is shared across alleles, by 
considering allele-specific, supertype-specific and common weight vectors. 

In summary, it is possible to embed the allele information in the description of 
the data point to estimate linear models in the new peptide x allele space to share 
information across alleles. It is furthermore possible to adjust how information is 
shared by choosing adequate functions $(x, a) to represent (peptide, allele) pairs. In 
other words, it is possible to consider the problem of leveraging across the alleles as a 
simple choice of representation, or feature design for the (peptide, allele) pairs that are 
to be used to learn the classifier. This approach, however, is limited by at least two 
constraints: 

• It can be uneasy to figure out how t o repr esent the allele information in the 



mapping $(x,a). In IHeckerman et all . 120061 . this is done via Boolean conjunc- 



tions and leads to a convenient form for the prediction functions, like ([]]) with 
a third term accounting for the supertype. Including more prior knowledge re- 
garding when two alleles should share more information, e.g., based on structural 
similarity between alleles, is however not an easy task. 

• Practically, injecting new features in the vector $(x, a) increases the dimension of 
the space, making statistical estimation, storage, manipulation and optimization 
tasks much harder. 

In the next subsection we show how both limitations can be overcome by reformulating 
this approach in the framework of kernel methods. 
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2.2 The kernel point of view 



SVM, and more generally kernel methods, only access data through th e computation 
of inner products between pairs of data points, ca lled a kernel function (jVapnikl . Il 9981 : 



Scholkopf and Smolal . 1200 
Win 



Scholkopf et all . 12004 ) . As a result, estimating the weights 
with a SVM does not require to explicitly compute or store the vectors $(x, a) 



for training and test pairs of alleles and peptides. Instead, it only requires to be able 
to compute the kernel between any two pairs (x, a) and (x',a') given, in our linear 
example, by: 

K ((x, a), (x, a')) =$(x, a) T $(x' , a) 

2x T x' if a = a' , 



T I 



if a ^ a' . 



Let us now introduce the following two kernels, respectively between peptides only and 
between alleles only: 



K a ii{a,a 




a', 



1 if a 7^ a' 



It is easy to see that both kernels are valid positive definite kernels for peptides and 
alleles, respectively. With these notations we see that the kernel for pairs (x, a) can be 
expressed as the product of the kernel for alleles and the kernel for peptides: 



K ((x, a), (x', a!)) = K aa (a, d)K pep (x, x') , 



(3) 



which is also the kernel asso ciated to the ten sor product space of the Hilbert spaces 



associated to K pep and K a u (lAronszainl. [l950h - Such kernels are used in particular in 



the field of multitask learning lEvgeniou et al 



(120051 ). where several related machine 
learning tasks must be solved simultaneously. The allele kernel K aU quantifies how 
information is shared between alleles. For example, in the simple model (DQ) the kernel is 
simply equal to 2 if an allele is compared to itself, 1 otherwise, meaning that information 
is uniformly shared across dif f erent alleles. Alternatively, adding supertype-specific 
features like iHeckerman et al\ (|2006l ) would result in a kernel equal to 3 between an 
allele and itself, 2 between two different alleles that belong to a common supertype, 
and 1 otherwise, resulting in increased sharing of information within supertypes. 

Interestingly this formulation lends itself particularly well to further generalization. 
Indeed, for any positive definite kernels K a u and K pep for alleles and peptides, respec- 
tively, their product pi) is a val i d pos itive definite kernel over the product space of 
pairs (peptide, allele) (lAronszajnl . 1 1950h . This suggests a new strategy to design pre- 
dictive models for epitopes across alleles, by designing specific kernels for alleles and 
peptides, respectively, and combining them to learn all allele-specific models simulta- 
neously with the tensor product kernel j3]). Benefits of this strategy over the explicit 
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design and computation of feature vectors 3>(x, a) are two-folds. First, it splits the 
problem of feature vector design into two subproblems (desig ning two kernels), eac h 



of which can benefit from previous work on kernel design (e.g.. IScholkopf et all 12004 ) . 
For example, the fact that nonlinear kernels such as Gaussian or polynomial kernels for 
peptides give good results for SVM trained on individual alleles suggest that they are 
natural candidates for the peptide part of the product kernel. Second, working with 
kernels alleviates the practical issues due to the potentially large size of the feature 
vector representation a) in terms of memory for storage or speed of convergence 
of algorithms. We now describe in more details the kernels K pep and K a u that can 
be used for peptides and alleles, respectively, to create the product kernel used in the 
application. 

2.3 Peptide kernels 

We consider in this paper mainly peptides made of 9 ami no acids, although extension s 



to variable-length peptides poses no difficulty in principle (ISalomon and Flower! . 120061 ) . 
The classical way to represent these 9-mers as fixed length vectors is to encode the 
letter at each position by a 20-dimensional binary vector indicating which amino acid 
is present, resulting in a 180-dimensional vector representations. In terms of kernel, 
the inner product between two peptides in this representation is simply the number 
of letters they have in common at the same positions, which we take as our baseline 
kernel: 

i 

K Unseq (x,x') = y]<y(z[i]a/[i]), 
i=i 

where I is the length of the peptides (9 in our case), x[i] is the i-th residue in x and 
is 1 if x[i] = x'[i], otherwise. 
Alternatively, several authors have noted that nonlinear varia nts of the linear kernel 



can improve the perfo r mance of SVM for epitope prediction (IDonnes and Elofsson 



20021 : IZhao et all . 120031 ; iBhasin and Raghaval . 12004 ). In particular, using a polynomial 



kernel of degree p over the baseline kernel is equivalent, in terms of feature space, 
to encoding p-order interactions between amino acids at different positions. In order 
to assess the relevance of such non-linear extensions we tested a polynomial kernel of 
degree 5, i.e., 

■Kseq§{Xi X ) (Kn nse q(x , X ) -|- 1) . 

In order to limit the risk of overfitting to the benchmark data we restrict ourselves 
to the evaluation of the baseline linear kernel and its nonlinear polynomial extension. 
Designing a specific peptide kernel for epitope prediction, e.g., by weighting differ- 
ently the positions known to be critical in the MHC-peptide complex, is however an 
interesting research topic that could bring further improvements in the future. 
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2.4 Allele kernels 



Although the question of kernel design for peptides has been r aised i n previous studies 



involving; SVM for epitope prediction ( Donne s and Elofssonl . 12002 ; IZhao et all 12003 



Bhasin and Raghaval . 12004 ; ISalomon and Flowed . 120061 ) , the question of kernel design 
for alleles is new to our knowledge. We tested several choices that correspond to 
previously published approaches: 

• The Dirac kernel is: 



K 



if a = a' , 
otherwise. 



With the Dirac kernel, no information is shared across alleles and the SVM 
learns one model for each allele independently from the others. Therefore this 
corresponds to the classical setting of learning epitope prediction models per 
allele with SVM. 



• The uniform kernel is: 



uniform 



(a, a') = 1 for all a, a' 



With this kernel all alleles are considered the same, and a unique model is created 
by pooling together the data available for all alleles. 



The multitask kernel is: 



K multitask^, a ') ~ Kdirac{.Q>i o!) + K, 



uniform 



(a, a') 



As explained in the previous section and in lEvgeniou et all (|2005l ) this is the 
simplest way to train different but related models. The SVM learns one model 
for each allele, using known epitopes and non-epitopes for the allele, but using also 
known epitopes and non-epitope for all other alleles with a smaller contribution. 
The training peptides are shared uniformly across different alleles. 



The supertype kernel is 



K supertypeifl •> ® ) ^ 



multitask 



where S s (a, a 1 ) is 1 if a and a' are in the same supertype, otherwise. As explained 
in the previous section this scheme trains a specific models for each allele using 
training peptides from different alleles, but here the training peptides are more 
shared across alleles withing a supertype than across alleles in different super- 
types. This is used by iHeckerman et al\ (120061 ) , without the kernel formulation, 
to train a logistic regression model. 



7 



Heckerman et al\ ( 120061 ) show that the supertype kernel generally improves the perfor- 
mance of logistic regression models compared to the uniform or Dirac kernel. Intuitively 
it seems to be an interesting way to include prior knowledge about alleles. However, 
one should be careful since the definition of supertypes is based on the comparison of 
epitopes of different alleles, which suggests that the supertype information might be 
based on some information used to assess the performance of the method in the bench- 
mark experiment. In order to overcome this issue, and illustrate the possibilities offered 
by our formulation, we also tested a kernel between alleles which tries to quantify the 
similarity of alleles without using known epitope information. For that purpose we rea- 
soned that alleles with similar residues at the positions involved in the peptide binding 
were more likely to have similar epitopes, and decided to ma ke a kernel between allele s 



based on this information. For each locus we gathered from iDoytchinova et al\ (12004 ) 
the list of positions involved in the binding site of the peptide (Table d]) . Taking the 
union of these sets of positions we then represented each allele by the list of residues 
at these positions, and used a polynomial kernel of degree 7 to compare two lists of 
residues associated to two alleles, ie, 



K bsite7 (a, a) = S~] S(a[i]a'[i\) + 1 




where bsite is the set of residues implied in the binding site for one of the three allele 
groups HLA-A, B, C, a[i] is the z-th residue in a and 5(a[i]a'[i]) is 1 if a[i] = a'[i], 
otherwise. 



2.5 SVM 



We l earn epitope models with SVM, a state-of-the-art algorithm for p attern recogni 



tion ( jVapnikl . Il998l : IScholkopf and Smolal . 12002 : IScholkopf et all |200J). We used the 



libsvm SVM implementation, with a custom kernel to account for the various kernels 
we tested, in the PyML environment (http://pyml.sourceforge.org). Besides the 
kernel, SVM depends on one parameter usually called C. For each experiment, we se- 
lected the best C among the values 2 l , i £ { — 15, —14, . . . , 9, 10} by selecting the value 
leading to the largest area under the ROC curve estimated by cross-validation on the 
training set only. The performance of each method was then tested on each experiment 
by evaluating the AUC over the test data. 



3 Data 

In order to evaluate both the performance of our method and the impact of using 
various kernels for the peptides or the alleles, we test our method on three different 
benchmark datasets that have been compiled recently to compare the performance of 
epitope prediction algorithms. 
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We first use two datasets compiled by iHeckerman et all (120061 ). where it is al- 
ready shown that leveraging improves prediction accuracy with respect to the best 
published results. The first dataset, called syfpeithy+lanl, co mbines experimen 



tally confirmed positive epitopes from the SYFPEITHY database (see lRammensee et al. 



19991 . available at http : / /www . syf peithy . de ) and from the Los Alamos HIV database 



(http://www.hiv.lanl.gov) and negative example randomly drawn from the HLA 
and amino acid distri bution in the pos it ive ex amples, for a total of 3152 data points. 



For more details, see IHeckerman et al\ ( 120061 ) where this dataset is used to compare 



the leveraged logistic regression with DistBoost. Since this dataset is quite small and 
was already used as a benchmark, we use it as a first performance evaluation, and to 
compare our kernels. 

The second dataset of IHeckerman et al\ (120061 ) contains 160,085 peptides includ- 
ing those from SYS FPEITHY+LANL and others from the MHCBN data repository (see 



Bhasin et all 120031 , available at http : //www. imtech. res . in/raghava/mhcbn/ index .html ). 
This corresponds to 1,585 experimentally validated epitopes, and 158,500 randomly 
generated non-binders (100 for each positive). We only kept 50 negative for each pos- 
itive in the interest of time and assuming this would not deteriorate too much the 
performance of our algorithm. In the worst case, it is only a handicap for our methods. 

Finally, we assess the perfo rmance of our met hod on the MHC-peptide binding 
benchmark recently proposed by lPeters et al\ (120061 ) who gathered quantitative peptide- 
binding affinity measurements for various species, MHC class I alleles and peptide 
lengths, which makes it an excellent tool to compare MHC-peptide binding learning 
methods. Since our method was first designed for binary classification of HLA epi- 
topes, we focused on the 9-mer peptides for the 35 human alleles and thresholded at 
IC50 = 500. Nevertheless, the application of our method to other species or peptide 
lengths would be straightforward, and generalization to quantitative prediction should 
not be too problematic either. The benchmark contained 29336 9-mer. 

The first dataset is 5-folded, the second 10-folded, so that the test be only per- 
fo rmed on HIV (LANL) d ata. The third dataset is 5-folded. We used the same folds 
as 



Heckerman et al\ (120061 ). available at ftp : //ftp . research . m icrosoft . com/users/heckerma/recom 



for the first two datasets and the same folds as lPeters et al.nl2006l ) available at http : //mhcbindingpredi 
for the third one. 

Molecule-based allele kernels require the amino-acid sequences corresponding to 
each allele. These sequenc es are available in various databases, includingh ttp : //www . anthonynolan . org 
and lRobinson et al\ (120001 ) . We used the peptide-sequence alignment for HLA- A, HLA- 
B and HLA-C loci. Each sequence was restricted to residues at positions involved in 
the binding site of one of the three loci, see table [TJ Preliminary experiments showed 
that using this restriction instead of the whole sequences didn't change the performance 
significantly, but it speeds up the calculation of the ke rnel. We were no t able to find 
the sequence of a few molecules of the two datasets of IHeckerman et al\ (120061 ). so in 
the experiments implying these datasets and a molecule-based allele kernel, we used 
K bsit e7(a, a') + K multitask (a, a') instead of simply using K bsite7 (a, a') , with a sentinel 
value of K bsite7 (a, a') = in these cases. This is the sum of two kernels, so still a 
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Locus Positions 



HLA-A 5, 7, 9, 24, 25, 34, 45, 59, 63, 66, 67, 70, 74, 77, 80, 81, 84, 97, 99, 
113, 114, 116, 123, 133, 143, 146, 147, 152, 155, 156, 159, 160, 163, 
167, 171 

HLA-B 5, 7, 8, 9, 24, 45, 59, 62, 63, 65, 66, 67, 70, 73, 74, 76, 77, 80, 81, 84, 
95, 97, 99, 114, 116, 123, 143, 146, 147, 152, 155, 156, 159, 160, 163, 
167, 171 

HLA-C 5, 7, 9, 22, 59, 62, 64, 66, 67, 69, 70, 73, 74, 77, 80, 81, 84, 95, 97, 99, 
116, 123, 124, 143, 146, 147, 156, 159, 163, 164, 167, 171 



T able 1: Residue pos i tions involved in the binding site for the three loci, according 



to 



Doytchinova et all ( 120041 ) 



positive definite kernel and actually exactly the same thing as K supertype with Kt, s u e 7 
instead of 6 S . 



4 Results 

We first use Ki inseq and K seq5 for the peptides and K uni f orm (one SVM for all the 
alleles), K Dirac (one SVM for each allele), K muUitask , K supertype and K bsite7 for the 
alleles on the small SYFPEITHI+LANL dataset. Using combinations of molecule-based 
and non-molecule-based kernels for K a u didn't improve the prediction, generally the 
result was as good as or slightly worse than the result obtained with the best of the 
two combined kernels. Results are displayed on Table [2], and ROC curves for Ku nseq x 

H-Dirac-i ^-linseq X K super fy pe: K se q5 X K super iy pe and K se q^ X Kij S ^ e j On figure [TJ 

Table [2] demonstrates the benefits of carefully sharing information across alleles. 
The Dirac allele kernel being the baseline kernel corresponding to independent train- 
ing of SVM on different alleles, we observe an improvement of at least 2% when infor- 
mation is shared across alleles during training (with the multitask, supertype or bsitel 
strategies). It should be noted, however, that the uniform strategies which amount 
to training a single model for all alleles perform considerably worse than the Dirac 
strategies, justifying the fact that it is still better to build individual models than a 
single model for all alleles. Among the strategies to share information across alleles, 
the supertype allele kernel seems to work slightly better than the two other ones. How- 
ever, one should keep in mind that there is a possible bias in the performance of the 
supertype kernel, because some peptides in the test sets might have contributed to the 
definition of the allele supertypes. Among the multitask kernel, which considers all dif- 
ferent alleles as equally similar, and the bsitel kernel, which shares more information 
between alleles that have similar residues at key positions, we observe a slight benefit 
for the bsitel kernel, which justifies the idea that including biological knowledge in our 
framework is simple and powerful. Finally, we observe that for all allele kernels, the 
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K aU \K pep linseq seq5 

uniform 0.826 ± 0.010 0.883 ± 0.011 

Dirac 0.891 ±0.014 0.893 ± 0.024 

multitask 0.910 ± 0.008 0.936 ± 0.008 

supertype 0.923 ± 0.011 0.943 ± 0.015 

bsite7 0.919 ± 0.011 0.943 ± 0.009 

Table 2: AUC results for an SVM trained on the syfpeithi+lanl with various kernel 
and estimated error on the 5 folds. 



nonlinear seq5 peptide kernel outperforms the baseline linseq kernel, confirming that 
linear models based on position-specific score matrices might be a too restrictive set of 
models to predict accurately epitopes. 

In terms of absolute value, all three allele kernels that share information across al- 
leles combined with the nonlinear seq5 pepti de kernel ( AUC = 0.943 ± 0.015) strongly 
outperform the leveraged logistic regression of lHeckerman et al. d2006h f AUC = 906± 



0.016) and the boosted distance metric learning algorithm of Hertz and Yanoverl (120061 ) 
(AUC = 0.819 ± 0.055). This corresponds to a decrease of roughly 40% of the area 
above the ROC curve compared to the best method. As the boosted distance metric 
learnin g approach was shown to b e superior to a variety of state-of-the-art other meth- 
ods by Hertz and Yanover (j200d ). this suggest that our approach can compete if not 



overcome the best methods in terms of accuracy. 

As we can clearly see in Table GO, two factors are i nvolv ed in the improvement over 



the leveraged logistic regression of lHeckerman et all (120061 ): 



• The use of an SVM instead of a logistic regression, since this is the only difference 
between the leveraged logistic regression and our SVM with a Ku nseq x K supertype 
kernel. This, however, may not be intrinsic to the algorithms, but caused by 
optimization issues for the logistic regression in high dimension. 

• The use of a non-linear kernel for the peptide, as we observe a clear improvement 
in the case of SVM (this improvement might therefore also appear if the logistic 
regression was replaced by a kernel logistic regression model with the adequate 
kernel). 

Figure Q] illustrates the various improvement underlined by this experiment: first 
from the individual SVM (K Unseq x K Dirac ), to the K Hnseq x K supertype SVM which is the 
SVM equivalent of leveraged logistic regression, and finally to K seq5 x K supertype and 
K seq5 x Kb s ue7 SVM that both give better performances than Ki inseq x K supertype SVM 
because they use a nonlinear kernel to compare the peptides. It is also worth noting 
that the supertype and the bsitel strategies give very similar results, which makes them 
two good strategies to leverage efficiently across the alleles with different information. 
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Figure 1: ROC curves on the pooled five folds of the SYFPEITHI+LANL benchmark. 



These results are confirmed by the mhcbn+syfpeithi+lanl benchmark, for 
which the results are displayed in Table [3l Again, the use of SVM with our prod- 
uct kernels clearly improves the performance with respect to iHeckerman et al\ (120061 ) 
(from 0.906 to 0.938). Moreover, we again observe that learning a leveraged predic- 
tor using the data from all the alleles improves the global performance very strongly, 
hence the important step between Dirac (0.867) and all the multitask-based methods, 
including the simplest multitask kernel (0.934). It is worth reminding here that the 
multitask kernel is nothing but the sum of the Dirac and uniform kernels, i.e., that it 
contains no additional biological information: the improvement is caused by the mere 
fact of using roughly (with a pondering of 0.5) the points of other alleles to learn the 
predictor of one allele. Figure [2] show the ROC curves for SVM with K seq5 x Koirac, 
K seq5 x K supertype and K seq5 x kernels on this benchmark. Again, we clearly 

see the strong improvement between leveraged and non-leveraged strategies. The dif- 
ference between the K seq5 x K Dirac and the two others is only caused by leveraging, 
since in the three case the same nonlinear strategy was used for the peptide part. On 
the other hand, the figure illustrates once again that our two high-level {i.e., more so- 
phisticated than multitask) strategies for leveraging across alleles give almost the same 
result. 



Finally, Table [4] presents the performance on the iedb benchmark proposed in lPeters et al 



( 120061 ). The indicated performance corresponds, for each method, to the average on 
the AUC for each of the 35 alleles. This gives an indicat ion of the global pe rformances 
of each methods. The ANN field is the tool proposed in lPeters et al. d2006h_ giving the 



best r esults on the 9-mer dataset, an artificial neural network proposed in lNielsen et al. 



(120031 ). whil e the APT fi eld re fers to the adaptive double threading approach recently 
proposed in Ijojic et all (120061 ) and tested on the same benchmark. These tools were 
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Method AUC 

Leveraged LR 0.906 

K linseq x K stype 0.916 ± 0.008 

K seg5 x K dirac 0.867 ± 0.010 

K seq5 x K multitask 0.934 ± 0.006 

K seg5 x K stype 0.939 ± 0.006 

K seq5 x K bsite7 0.938 ± 0.006 

Table 3: AUC results for an SVM trained on the mhcbn+SYFPEITHI+lanl bench- 
mark with various kernel and estimated error on the 10 folds. 




seq5 x Dirac 

seq5 x supertype 

seq5 x bsite7 



False positive rate 



Figure 2: ROC curves on the pooled ten folds of the mhcbn+syfpeithi+lanl bench- 
mark. 



co mpared to a n d sig nificantly out performed other t ools in th e compreh e nsive study 
of IPeters et all (120061 ). specifically IPeters and Settd ( 120051 ) and lBui et al\ ( 120051 ). that 
are both scoring- matrix-based. Our ap proach gives equivalent results in terms of global 
performances as iNielsen et all (120031 ) . and therefore outperforms the other internal 
methods. 

Table [5] presents the performances on the 10 alleles with le ss than 200 t r aininc 



points, together with the performances of the best internal tool, INielsen et al. 
ANN, and the adaptive double threading model that gave good prediction perfor- 
mances on the alleles with few training data. Except for one case, our SVM outper- 
fo rms both mod e ls. Th is means of course that our approach does not perform as well 
as INielsen et all (120031 ) on the alleles with a large training set, but nothing prevents 



an immunologist from using one tool for some alleles and another tool for other alleles. 
As we said in introduction, our original concern was to improve binding prediction for 
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Method 






AUC 


SVM with K seq5 


X 




0.804 


SVM with K seq5 


X 


V 

1 1 supertype 


0.877 


SVM with K seq5 


X 


Kbsite7 


0.892 


ADT 






0.874 


ANN 






0.897 



Table 4: AUC results for an SVM trained on the iedb benchmark with various methods. 



Allele 


Peptide number 


Kseq5 x Kbsitej 


ADT 


ANN 


A 


_2301 


104 


0.887 ±0.021 


0.804 


0.852 


A 


2402 


197 


0.826 ±0.025 


0.785 


0.825 


A 


2902 


160 


0.948 ±0.015 


0.887 


0.935 


A 


3002 


92 


0.826 ± 0.048 


0.763 


0.744 


B 


_1801 


118 


0.866 ±0.020 


0.869 


0.838 


B 


4002 


118 


0.796 ±0.025 


0.819 


0.754 


B 


4402 


119 


0.782 ± 0.084 


0.678 


0.778 


B 


4403 


119 


0.796 ± 0.042 


0.624 


0.763 


b" 


4501 


114 


0.889 ±0.029 


0.801 


0.862 


b" 


5701 


59 


0.938 ± 0.046 


0.832 


0.926 



Table 5: Detail of the iedb benchmark for the 10 alleles with less than 200 training 
points (9-mer data). 



alleles with few training points, and for which it is hard to generalize. This was the 
main point of using a multitask learning approach. The results on this last benchmark 
suggest that the leveraging approaches succeed in improving prediction performances 
when few training points are available. 

5 Discussion and concluding remarks 

In this paper, we introduced a general framework to share efficiently the binding in- 
formation available for various alleles by simply defining a kernel for the peptides, and 
another one for the alleles. The result is a simple model for MHC-peptide binding 
prediction that uses information from the whole dataset to make specific prediction 
for any of the alleles. Our approach is simple, general and both easy to adapt to a 
specific problem by using more adequate kernels, and to implement, by running any 
SVM implementation with these kernels. Everything is performed in low dimension 
and with no need for feature selection. 

We presented performances on three benchmarks. On the first two benchmark, 
our approach performed considerably better than the state-of-the-art, which illustrates 
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the good general behavior in terms of prediction accuracy. Besides, these experiments 
clearly confirmed the interest of leveraging the information across the alleles. On the 
la st benchmark , the r esults were globally comparable to the best state-of-the-art tested 



m 



Peters et al\ (120061 ) , with a strong improvement on the alleles for which few training 



points were available, probably, as it was already observed, because of the fact that 
our model uses all the points from all the alleles for each allele-specific prediction. 

Another contribution is the use of allele sequences, which allows us to improve the 
prediction accuracy and to do as well as what was done with the supertype information. 
Supertype is a crucial information and a key concept in the development of epitope- 
based vaccines, for example to find epitopes that bind several alleles instead of just one. 
However, one should be careful when using it to learn an automatic epitope predictor 
because even if the idea behind a supertype definition is to represent a general ligand 
trend, the intuition is always guided by the fact that some alleles have overlapping 
repertoires of known binders, and it is not easy to figure out to which extent the known 
epitopes used to assess the predictor performances were used to design the supertypes. 

Because of these overfitting issues and the fact that supertypes are difficult to define, 
the good performances of molecule-based allele kernel with respect to the supertype- 
based allele kernels are good news. This potentially allows us to leverage efficiently 
across alleles even when the supertype is unknown, which is often the case, and we don't 
take the risk to use overfitted information when learning on large epitope databases. 

Although the kernels we used already gave good performances, there is still room 
for improvement. A first way to improve the performances would be to use more ad- 
equate kernels to compare the peptides and, probably more important, to compare 
the alleles. In other words answering the question, what does it mean in the context 
of MHC-peptide binding prediction for two alleles to be similar? Possible answers 
should probably involve better kernels for the allele sequences, and structural infor- 
mation which could be crucial to predict binding and, as we said in introduction, is 
al ready used in some model s. Another interesting possibility is, as it was suggested 



Hertz and Yanover (120071 ). the use of true non-binders, that could make the pre- 



m 

dictor more accurate than randomly generated peptides since these experimentally 
assessed peptides are in general close to the known binders. Finally, it could be use- 
ful to incorporate the quantitative IC50 information when available, instead of simply 
thresholding as we did for the last benchmark. 

This leads us to the possible generalizations we hope to work on, besides these 
improvements. Using the binding affinity information, it is obviously possible to apply 
our general framework to predict quantitative values, using regression models with 
the same type of kernels. This framework could also be used for a lot of similar 
problems involving binding, like MHC-type-II-peptide binding where sequences can 
have variable length and the alignment of epitope s usually performed as pre-processing 



can be ambiguous. ISalomon and Flowerl (120061 ) already proposed a kernel for this 



case. Another interesting application would be drug design, for example protein-kinase- 
inhibitor binding prediction, or prediction of a virus susceptibility to a panel of drugs 
for various mutations of the virus. 
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